We consider data from the US Energy Information Administration’s online data browser.The data was downloaded from Table 2.2 of the US Energy Information Administration’s online data browser (https://www.eia.gov/totalenergy/data/browser/). The data set contains multiple time series, detailing:
The aim of this study is to analyse the monthly residential energy consumption in the US, since January 2000.We will consider two cases:
The data includes 7,896 observations for each month of each year since 1949 with 6 columns as it can be seen in the table below.
| MSN | YYYYMM | Value | Column_Order | Description | Unit |
|---|---|---|---|---|---|
| CLRCBUS | 194913 | 1271.551 | 1 | Coal Consumed by the Residential Sector | Trillion Btu |
| CLRCBUS | 195013 | 1261.267 | 1 | Coal Consumed by the Residential Sector | Trillion Btu |
| CLRCBUS | 195113 | 1158.679 | 1 | Coal Consumed by the Residential Sector | Trillion Btu |
| CLRCBUS | 195213 | 1079.206 | 1 | Coal Consumed by the Residential Sector | Trillion Btu |
| CLRCBUS | 195313 | 965.664 | 1 | Coal Consumed by the Residential Sector | Trillion Btu |
In particular:
Before processing the data, we extract the abbreviations (MSN) and descriptions(Description), and we divide them into two data frames, one containing all individual energy consumption per source and one containing the total values. Those will be later used for the visualizations. The two tables can be seen below:
##Extract codes and names
code<-data.frame(Detail=unique(orig_dat$MSN),Desc=unique(orig_dat$Description))
##Individual Values
ind<-code %>%
filter(!grepl('Total', Desc))
##Total Values
tot<-code %>%
filter(grepl('Total', Desc))
Next, we look into the data type of each column.
str(orig_dat)
## spec_tbl_df [7,896 × 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ MSN : chr [1:7896] "CLRCBUS" "CLRCBUS" "CLRCBUS" "CLRCBUS" ...
## $ YYYYMM : num [1:7896] 194913 195013 195113 195213 195313 ...
## $ Value : chr [1:7896] "1271.551" "1261.267" "1158.679" "1079.206" ...
## $ Column_Order: num [1:7896] 1 1 1 1 1 1 1 1 1 1 ...
## $ Description : chr [1:7896] "Coal Consumed by the Residential Sector" "Coal Consumed by the Residential Sector" "Coal Consumed by the Residential Sector" "Coal Consumed by the Residential Sector" ...
## $ Unit : chr [1:7896] "Trillion Btu" "Trillion Btu" "Trillion Btu" "Trillion Btu" ...
## - attr(*, "spec")=
## .. cols(
## .. MSN = col_character(),
## .. YYYYMM = col_double(),
## .. Value = col_character(),
## .. Column_Order = col_double(),
## .. Description = col_character(),
## .. Unit = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
unique(orig_dat$Unit)
## [1] "Trillion Btu"
We observe that:
We convert the original matrix into a data frame and drop the columns we mentioned above.We also convert the values that are set to “Not Available” to NA.
dat<-data.frame(orig_dat)
#Drop columns
dat<-subset(orig_dat, select = -c(Unit,Description,Column_Order))
###Missing values
dat$Value[dat$Value=='Not Available']<-NA
The column containing the dates is set to the numeric data type. We convert it to characters, extract the year and month, and finally drop the original column. Next, we drop the rows containing data from 1949 to 1999. Finally, we observe that there is a month 13, which corresponds to the total energy consumption for each year.
##Convert to character
dat$YYYYMM<-as.character(dat$YYYYMM)
##Create new column containing the dates
dat <- dat %>%
add_column(year = substr(dat$YYYYMM, start = 1, stop = 4),.after = "YYYYMM")
dat <- dat %>%
add_column(mon = substr(dat$YYYYMM, start = 5, stop = 6),.after = "year")
##Drop original column
dat <- subset(dat, select = -YYYYMM )
## Drop rows with a date before january 2000
ec<-subset(dat,year>=2000)
## Drop rows with month 13 which corresponds to total
ec<-subset(ec,mon!=13)
Finally, we convert all columns to the correct data types.
#Combine dates in the right format
ec$year<-as.numeric(ec$year)
ec$mon<-as.numeric(ec$mon)
ec$Value<-as.numeric(ec$Value)
ec$Date <- as.yearmon(paste(ec$year, ec$mon), "%Y %m")
The final data frame can be seen below:
| MSN | year | mon | Value | Date |
|---|---|---|---|---|
| CLRCBUS | 2000 | 1 | 1.466 | Jan 2000 |
| CLRCBUS | 2000 | 2 | 1.093 | Feb 2000 |
| CLRCBUS | 2000 | 3 | 0.848 | Mar 2000 |
| CLRCBUS | 2000 | 4 | 0.967 | Apr 2000 |
| CLRCBUS | 2000 | 5 | 0.648 | May 2000 |
Note: The last column may appear redundant, since we already have the year and month in the previous columns. We keep it as the format is easier for visualisation purposes.
There are 166 missing values, as seen in the figure and table below. Those values are only missing from the data containing coal energy consumption. To make the visualization process easier, we dropped such values rather of using them.
#Find number a missing values per group
gg_miss_fct(ec, MSN)
ec %>% group_by(MSN) %>% summarise(
non_na = sum(!is.na(Value)),na = sum(is.na(Value)))
## # A tibble: 12 × 3
## MSN non_na na
## <chr> <int> <int>
## 1 BMRCBUS 262 0
## 2 CLRCBUS 96 166
## 3 ESRCBUS 262 0
## 4 FFRCBUS 262 0
## 5 GERCBUS 262 0
## 6 LORCBUS 262 0
## 7 NNRCBUS 262 0
## 8 PARCBUS 262 0
## 9 RERCBUS 262 0
## 10 SORCBUS 262 0
## 11 TERCBUS 262 0
## 12 TXRCBUS 262 0
#Delete NAs and get final data frame
f.data <- na.omit(ec)
Let’s start by visualizing the energy consumption for each energy
source.
It’s difficult to detect any distinct trends from the above graph. We
observe natural gas is the most used energy source with significant
variations between the winter and summer seasons. Additionally, we can
observe that although there has been a little increasing tendency in
electricity sales over time, volatility has remained constant. We’re
going to visualize individually the consumption for fossil fuels and
renewable energy.
Next to better see the amount of energy used by each source, we will create a pie chart. The tables created in the data cleaning phase can be used to interpret the abbreviations.
Finally, we’re going to visualise the mean energy consumed per individual energy source.
## `summarise()` has grouped output by 'year'. You can override using the `.groups` argument.
## Date BMRCBUS CLRCBUS ESRCBUS
## Min. :2000 Min. :28.38 Min. :0.3820 Min. :260.1
## 1st Qu.:2002 1st Qu.:32.26 1st Qu.:0.6240 1st Qu.:315.7
## Median :2004 Median :33.61 Median :0.7730 Median :361.1
## Mean :2004 Mean :33.44 Mean :0.8529 Mean :367.1
## 3rd Qu.:2006 3rd Qu.:35.34 3rd Qu.:1.0283 3rd Qu.:410.3
## Max. :2008 Max. :36.52 Max. :1.7910 Max. :512.1
## GERCBUS LORCBUS NNRCBUS PARCBUS
## Min. :0.681 Min. : 604.5 Min. : 111.1 Min. : 54.93
## 1st Qu.:0.803 1st Qu.: 691.3 1st Qu.: 142.9 1st Qu.: 82.13
## Median :1.107 Median : 769.5 Median : 312.4 Median :107.41
## Mean :1.161 Mean : 804.0 Mean : 410.9 Mean :120.08
## 3rd Qu.:1.363 3rd Qu.: 892.7 3rd Qu.: 692.9 3rd Qu.:156.50
## Max. :1.868 Max. :1129.9 Max. :1001.5 Max. :226.45
## SORCBUS
## Min. :2.718
## 1st Qu.:3.648
## Median :4.649
## Mean :4.446
## 3rd Qu.:5.190
## Max. :5.984
## Warning in var(if (is.vector(x) || is.factor(x)) x else as.double(x), na.rm =
## na.rm): NAs introduced by coercion
## Date BMRCBUS CLRCBUS ESRCBUS GERCBUS LORCBUS
## NA 1.9651206 0.3333607 61.7581942 0.3657290 136.0019233
## NNRCBUS PARCBUS SORCBUS
## 286.4640758 45.3295816 0.9139452
## # A tibble: 1,930 × 3
## # Groups: MSN [8]
## MSN Value Date
## <chr> <dbl> <yearmon>
## 1 CLRCBUS 1.47 Jan 2000
## 2 CLRCBUS 1.09 Feb 2000
## 3 CLRCBUS 0.848 Mar 2000
## 4 CLRCBUS 0.967 Apr 2000
## 5 CLRCBUS 0.648 May 2000
## 6 CLRCBUS 0.656 Jun 2000
## 7 CLRCBUS 0.792 Jul 2000
## 8 CLRCBUS 0.81 Aug 2000
## 9 CLRCBUS 0.67 Sep 2000
## 10 CLRCBUS 0.531 Oct 2000
## # … with 1,920 more rows
We are going to calculate the summary data for the total consumption and sales.The graphs below show the total energy consumed, the energy consumed in the recent years and the proportion of each one.
## FFRCBUS RERCBUS TERCBUS TXRCBUS
## Min. : 134.5 Min. :32.33 Min. :1271 Min. : 192.3
## 1st Qu.: 202.5 1st Qu.:40.47 1st Qu.:1434 1st Qu.: 247.6
## Median : 354.4 Median :49.62 Median :1634 Median : 403.6
## Mean : 503.3 Mean :50.58 Mean :1743 Mean : 553.9
## 3rd Qu.: 813.7 3rd Qu.:58.68 3rd Qu.:1972 3rd Qu.: 863.6
## Max. :1224.4 Max. :76.88 Max. :2808 Max. :1272.0
| MSN | year | mon | Value | Date |
|---|---|---|---|---|
| NNRCBUS | 2000 | 1 | 881.954 | Jan 2000 |
| NNRCBUS | 2000 | 2 | 792.410 | Feb 2000 |
| NNRCBUS | 2000 | 3 | 562.305 | Mar 2000 |
| NNRCBUS | 2000 | 4 | 409.652 | Apr 2000 |
| NNRCBUS | 2000 | 5 | 233.428 | May 2000 |
In the figure below, it can be observed that the ACF is sinusoidal, which implies that the ACF is periodic. The peak recurs every 12 implies that the ACF has a period of 12 lags which when calculated is equivalent to a year.
The ACF is not approaching zero and thus it is most probably not stationary.It can also be observed that the majority of the ACF values are not within the bounds of statistical significance.
On the figure below, we observe that there is a slight upward trend.
For the multivariate analysis we’re choosing to study the total fossil and renewable energy consumption.For this purpose we create a new dataframe containing the variables of interest and then we reshape it in order to make our analysis easier.
We can see from the below graphic that there is a significant disparity between the usage of fossil fuels and renewable energy sources. We will examine each one in more detail in the sections that follow.